Exploring Financial Time Series Data

In this notebook, we explore the application of Independent Component Analysis on the top 28 S&P 500 stock returns from 11/06 to 11/16. We provide the data at https://github.com/Freshdachs/FinData .

In [9]:
import numpy as np
import pandas as pd
import datetime
import plotly.offline as py
import plotly.graph_objs as go
from sklearn.decomposition import FastICA, PCA
from itertools import islice
import functools

py.init_notebook_mode(connected=True)

plot = lambda d,lbls:py.iplot([go.Scatter(x=d.index,y=d[i],name=i) for i in lbls])

plot_off = lambda d,lbls:py.iplot([go.Scatter(x=d.index,y=d[l]+2*i) for (i,l) in enumerate(lbls) ])
In [2]:
#load timeseries
df=pd.read_csv('tss3.csv', sep=';',index_col=0,parse_dates=True,decimal=',')
#sort columns by sum
df = df[df.columns[np.argsort(df.apply(sum).values)][::-1]]

Data

We use the daily adjusted closing prices of 441 stocks on 2458 day from 2011-11-1 to 2016-11-1 out of the S&P 500. We retrieved the data from Quandl and cleaned it up, so that only stocks and days are remaining where we have complete data available.

In [319]:
def pre_diff(df):
    d=df.copy()
    d.values[1:] =df.values[1:]-df.values[:-1]
    return (d.drop([d.index[0]]),df[0:1])

def pre_mean(df):
    d=df.copy()
    return (d.apply(lambda i: i-np.mean(i)),d.apply(np.mean))

def pre_normalize(df):
    d=df.copy()
    fac=np.max(np.max(np.abs(d)))
    return (d.apply(lambda i: i/fac), fac)
    #fac=d.apply(lambda d: np.max(np.abs(d)))
    #fac = d.apply(np.var)
    #return (d/fac,fac.values)

def pre_process(df):
    d, base=pre_diff(df)
    d, means = pre_mean(d)
    d, fac=pre_normalize(d)
    return (d,base,means,fac)
In [320]:
pre_df, base,means,fac = pre_process(df)
In [321]:
idcs = np.argsort(df.apply(max).values)
lbls = df.columns[:8].values

Visualization of Top 8 Stock returns

Below you can find a visualization of the top 8 stock returns.

In [322]:
plot(df,lbls)

Differential, Normalized Returns

Below we compare the differential, normalized returns with the regular returns.

In [323]:
plot_off(pre_df,lbls)

ICA

in this section, we apply the ICA approach to our data.

In [336]:
# load ICA and fit our data
ica = FastICA(whiten=False,max_iter=1000)

#reduce datasize
d, base,means,fac = pre_process(df[df.columns[:28]])
In [337]:
# compute sources and invert-transformed signal S from dataframe d
def ica_d(d):
    ica_d = ica.fit(d)
    S = ica_d.transform(d)
    return (S, ica_d.mixing_, ica_d.inverse_transform(S))
S,A,x = ica_d(d)
In [338]:
def reconstruct(df,x,base,means,fac):
    d=df.copy()
    x = x*fac+means.values
    d.values[:] = (np.cumsum(x,0)+base.values)[:]
    return base.append(d)
In [339]:
def ranking(S,A):
     return np.apply_along_axis(lambda row: np.argsort(np.apply_along_axis(max,0,np.abs(S*row))),1,A)
In [340]:
def indices(R,i=0,n=8):
    return (R[i][-n:],R[i][:-n])

def sep_ics(S,A,R,i=0, n=8):
    return list(np.dot(S[:,f],A[:,f].T) for f in indices(R,i,n))

def to_df(d,v):
    e=d.copy()
    e.values[:]=v[:]
    return e
In [329]:
plot_off(to_df(d,S),d.columns[indices(R)[0]].values)

Top 8 Components for PCLN, ICA

In [330]:
R=ranking(S,A)
X_maj, X_min =sep_ics(S,A,R)
rec_df = reconstruct(d,X_maj,base,means,fac)
min_df = reconstruct(d,X_min,base,means,fac)
plot(rec_df,lbls)
plot(min_df,lbls)
plot(df,lbls)

Here we see that both the minor components and the major components can be reconstructed to similar approximations. We see upwards trends in both our components, how can the be? We substract the mean from our data during the preprocessing. This means that even with a 0 signal from our reconstruction, we still add the mean during the reconstruction. Since we reconstruct the derivate, this addition translate to a constant slope.

This shows that we can successfully perform an ICA decomposition and back again.

Thresholded Reconstruction from ICA

Now we want to take a look at the thresholded Reconstruction.

In [343]:
#reconstruct whitened X with activation function g
array_with_g = lambda g,S,A: np.array([[np.sum(np.vectorize(g)(row_S*row_A))for row_A in A] for row_S in S])
activation = lambda e: lambda i:  i if (i*i)>(e*e) else 0
activation_tst = lambda e: lambda i:  1 if (i*i)>(e*e) else 0
activation_inv = lambda e: lambda i:  0 if (i*i)>(e*e) else i
def sep_ics_g(S,A,e):
    return (array_with_g(activation(e),S,A),
            array_with_g(activation_inv(e),S,A),
            np.mean(array_with_g(activation_tst(e),S,A))/A.shape[0])
In [344]:
S,A,x = ica_d(d)
X_maj_t, X_min_t, rat = sep_ics_g(S,A,0.01)
rat
Out[344]:
0.028490547623200684
In [345]:
df_maj, df_min = (reconstruct(d,x,base,means,fac) for x in (X_maj_t,X_min_t))
[plot(df,lbls) for df in (df_min,df_maj,d)]
Out[345]:
[None, None, None]
Normalization by columnwise max

We can see that the thresholded reconstruction with a threshold of 0.01 and a keeping rate of ~ 0.2 looks mostly fine, but has some serious issues with certain stocks (CMG) which drop off a lot. Maybe at these points, we have a lot of small weights across the different components, which would usually cancel out a stronger signal?

Normalization by global max

We can see that the global maximum is really good at reconstructing Priceline (the biggest stock) and not much else. However, even with a really tiny keeping ratio, we still see some information in the smaller stocks. However, since these are the 8 biggest stocks, these might also be the most resilient ones.

Normalization by variance

PCA

In this section, we analyze what results a PCA approach yields.

In [346]:
pca = PCA()

# compute sources and invert-transformed signal S from dataframe d
def pca_d(d):
    pca_d = pca.fit(d)
    S=pca_d.transform(d)
    return (S, pca_d.components_.T, pca_d.inverse_transform(S))
In [347]:
S,A,x = pca_d(d)
R = ranking(S,A)
X_maj, X_min =sep_ics(S,A,R)
rec_df, min_df = (reconstruct(d,x,base,means,fac) for x in (X_maj,X_min))
[plot(d,lbls) for d in (rec_df,min_df,df)]
Out[347]:
[None, None, None]

Thresholded Reconstruction from PCA

In [350]:
S,A,x = pca_d(d)
X_maj_tp, X_min_tp, rat =sep_ics_g(S,A,0.01)
rec_df, min_df = (reconstruct(d,x,base,means,fac) for x in (X_maj_tp, X_min_tp))
[plot(rec_df,lbls) for d in (rec_df,min_df,df)]
rat
Out[350]:
0.017249237912503219
global max norm

We can see that PCA does rather well at reconstructing with an eps of 0.1 and a ratio of <2%.

variance norm

The variance norm does not work all that well as we have serious issues with the reconstruction, even at a rate of 33%.

local max norm

local max norm works also pretty good, as we have pretty nice reconstructions, though with a mysterious drop off in CMG at a rate of 21%.

In [273]:
plot_off(to_df(d,S),d.columns[indices(R)[0]].values)

Misc.

Random collection of code snippets + more.

In [490]:
topsources = lambda S,n: np.argsort( np.apply_along_axis(lambda x:np.max(np.abs(x)),0,S))[-n:]
topsources(S,8)
Out[490]:
array([410, 152, 386, 158, 183, 354,  14, 202])
In [491]:
#show top
py.iplot([go.Scatter(y=S[i]) for i in topsources(S,8)])
In [484]:
np.argsort( np.apply_along_axis(np.max,0,S))
Out[484]:
array([354,  14, 314, 257, 227, 122, 316,  64, 234, 298, 167, 261, 307,
       293, 399, 151, 336, 436, 353, 131, 364,   2, 211, 109, 343, 144,
        34,  85,  81, 290, 240, 352, 394, 169, 415, 347, 411, 440, 337,
       258, 135,   1, 393, 365, 186, 378, 114,  98, 113, 263, 196, 308,
       198, 332,  28, 388, 315, 162,  93,  17, 136,  47,  78, 344, 181,
       310,  35, 164, 271, 387,  49, 161, 236, 346, 155, 260, 304, 292,
       158,  46,  18, 133, 323, 341, 157, 107, 241, 141, 395, 432, 324,
       383,  45, 418, 312,  76,  55, 207, 342,  27, 239, 423,   5, 372,
       246, 160, 278, 249, 380, 357, 385, 194, 129, 424, 402, 428, 140,
       193, 437, 280, 218,  74, 369, 214,  83, 259, 412, 281, 105, 371,
        92,  24, 178, 374, 287, 153, 201, 191, 254, 209,  44, 104, 250,
       272, 244, 302, 205, 226, 138,  30, 128,   9, 318, 300, 309, 130,
        86, 384,  41,  95, 230,  22, 126, 338, 275, 143,  59, 243, 274,
       363, 123, 232, 148, 417,  48, 409, 184, 118, 377, 361,  16,  11,
        50, 320, 190,  19, 416, 267, 283,  60, 245, 367, 220,   6, 266,
       223, 233, 121,  67, 154, 165,  33, 403, 362, 340, 185, 396, 111,
       295, 116, 156, 146, 405, 359,  75, 334, 286, 204, 390, 322, 195,
       262, 317,  82, 427, 398, 313, 197,  15, 291, 335, 431,  69, 333,
       366,  61,  13, 251, 210, 124, 331, 264, 208, 279, 179, 339, 199,
       187, 173, 110, 419, 400,  26,  79,  51,  91, 174,   3, 350,  57,
        12,  43,   8, 430, 434, 277, 276, 225, 426, 319, 255, 303,  84,
       108, 401,  29,  70, 327,  65, 177, 119,  36, 404, 435, 102, 139,
       397, 112, 355, 429,  54, 311, 106, 200,  31,   7, 206, 326,  23,
       420, 373, 192, 147, 382, 252, 132, 219, 242, 376, 370,  58, 356,
       391, 358,  56, 172, 349, 168, 299, 120,  32,  96, 248,  62, 231,
       433,   4, 289, 229,  20, 189, 288, 100, 297, 127,  38, 268, 247,
       351, 217,  37,  68,  21, 150,  97, 125,  10, 235, 149,  52,  77,
       256,  87, 282,  88, 379, 137, 439, 296, 213, 176, 360, 306, 166,
        89, 171, 103, 237,  94, 238, 134,  72, 222, 182, 273, 188, 389,
        40,  25, 228, 175, 421, 221, 329, 345, 413, 215, 159,  53,  71,
        63,  66, 142, 115, 422, 325,  90, 408, 216, 101, 381,  80, 163,
       368, 414, 212,  73, 265, 253, 224,   0, 180, 170, 321, 425, 438,
       305, 375, 269, 270, 117, 145, 348, 406, 330, 284, 301, 392,  42,
       407, 285, 294, 328,  39, 203,  99, 410, 152, 386, 183, 202])